The blood sample data comes from patients from the region of Wuhan, China. The data has been collected between 10 January and 18 February 2020. The original goal of collecting this data was to help identify crucial predictive biomarkers of disease mortality. More information can be found in the Tan et al article.
The analysis will focus on what we know about the patients. It will also attempt to see whether it’s possible to predict that a patient will die based on the available sample results.
Since the dataset contains a multitude of biomarkers the classification will focus on three of them which Tan et al have pointed out in their article:
Tan et al were able to use them to predict the mortality of individual patients more than 10 days in advance with more than 90% accuracy.
In the original dataset, aside from basic information about the patient, each row contains a timestamp of the blood sample results and the results for a select few biomarkers. The other biomarkers in a row are empty values. Thus for some parts of the analysis the dataset will have to be tweaked to make up for this. We will assume that, if a row doesn’t contain information about a biomarker, the closest approximate will be the most recent value of said biomarker in the past samples for this patient. If there are no samples that consider this biomarker’s value, then the first future value is taken. If no value for the biomarker is available for a patient, then the median of the whole dataset is used. Please bare the above in mind, as it is probably not ideal and can skew some of the results of the analysis.
14 samples, each corresponding to a different patient, are missing a registration date. They were the only samples for these patients. The patients’ admission date is assumed as the sample registration date.
The original dataset contains 81 columns and 6120 rows. Each row corresponds to a blood sample result. The data concerns 375 patients. For each patient there are multiple sample results.
Let’s take a look at the information we have about the patients.
| patient_id | age | gender | admission_time | discharge_time | death | hospitalized_days | |
|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. :18.00 | male :224 | Min. :2020-01-10 15:52:20 | Min. :2020-01-23 09:09:23 | no :201 | Min. : 0.0847 | |
| 1st Qu.: 94.5 | 1st Qu.:46.00 | female:151 | 1st Qu.:2020-02-01 19:27:40 | 1st Qu.:2020-02-11 13:39:21 | yes:174 | 1st Qu.: 4.4845 | |
| Median :188.0 | Median :62.00 | NA | Median :2020-02-04 22:30:34 | Median :2020-02-16 17:40:07 | NA | Median : 9.5942 | |
| Mean :188.0 | Mean :58.83 | NA | Mean :2020-02-04 20:13:51 | Mean :2020-02-15 16:42:59 | NA | Mean :10.8536 | |
| 3rd Qu.:281.5 | 3rd Qu.:70.00 | NA | 3rd Qu.:2020-02-10 04:11:10 | 3rd Qu.:2020-02-19 11:47:14 | NA | 3rd Qu.:15.6876 | |
| Max. :375.0 | Max. :95.00 | NA | Max. :2020-02-17 21:30:07 | Max. :2020-03-04 16:21:51 | NA | Max. :35.1708 |
According to this data males are more likely to die. Please note that the dataset contains less data for females than for males.
Older patients appear to be more likely to die.
A significant amount of patients dies shortly after being hospitalized. Most likely, we are observing the patients who are admitted in critical condition.
And here’s a short summary of all of the available attributes (patient info and biomarkers) before the cleaning of the dataset for further analysis:
| patient_id | re_date | age | gender | admission_time | discharge_time | death | hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time | procalcitonin | eosinophils | interleukin_2_receptor | alkaline_phosphatase | albumin | basophil | interleukin_10 | total_bilirubin | platelet_count | monocytes | antithrombin | interleukin_8 | indirect_bilirubin | red_blood_cell_distribution_width | neutrophils | total_protein | quantification_of_treponema_pallidum_antibodies | prothrombin_activity | h_bs_ag | mean_corpuscular_volume | hematocrit | white_blood_cell_count | tumor_necrosis_factor_u_03b1 | mean_corpuscular_hemoglobin_concentration | fibrinogen | interleukin_1ss | urea | lymphocyte_count | ph_value | red_blood_cell_count | eosinophil_count | corrected_calcium | serum_potassium | glucose | neutrophils_count | direct_bilirubin | mean_platelet_volume | ferritin | rbc_distribution_width_sd | thrombin_time | x_lymphocyte | hcv_antibody_quantification | d_d_dimer | total_cholesterol | aspartate_aminotransferase | uric_acid | hco3 | calcium | amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp | lactate_dehydrogenase | platelet_large_cell_ratio | interleukin_6 | fibrin_degradation_products | monocytes_count | plt_distribution_width | globulin | x_u_03b3_glutamyl_transpeptidase | international_standard_ratio | basophil_count | x2019_n_co_v_nucleic_acid_detection | mean_corpuscular_hemoglobin | activation_of_partial_thromboplastin_time | high_sensitivity_c_reactive_protein | hiv_antibody_quantification | serum_sodium | thrombocytocrit | esr | glutamic_pyruvic_transaminase | e_gfr | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. :2020-01-10 19:45:00 | Min. :18.00 | Min. :1.000 | Min. :2020-01-10 15:52:20 | Min. :2020-01-23 09:09:23 | Min. :0.0000 | Min. : 1.9 | Min. : 6.4 | Min. : 71.50 | Min. : 11.50 | Min. : 0.020 | Min. :0.000 | Min. : 61.0 | Min. : 17.00 | Min. :13.60 | Min. :0.00 | Min. : 5.00 | Min. : 2.50 | Min. : -1.0 | Min. : 0.300 | Min. : 20.00 | Min. : 5.000 | Min. : 0.100 | Min. :10.60 | Min. : 1.7 | Min. :31.80 | Min. : 0.020 | Min. : 6.00 | Min. : 0.000 | Min. : 61.60 | Min. :14.50 | Min. : 0.13 | Min. : 4.00 | Min. :286.0 | Min. : 0.500 | Min. : 5.00 | Min. : 0.800 | Min. : 0.000 | Min. :5.000 | Min. : 0.100 | Min. :0.000 | Min. :1.650 | Min. : 2.760 | Min. : 1.000 | Min. : 0.06 | Min. : 1.600 | Min. : 8.50 | Min. : 17.8 | Min. : 31.30 | Min. : 13.00 | Min. : 0.000 | Min. :0.020 | Min. : 0.210 | Min. :0.100 | Min. : 6.00 | Min. : 43.0 | Min. : 6.30 | Min. :1.170 | Min. : 5 | Min. : 110.0 | Min. :11.20 | Min. : 1.500 | Min. : 4.00 | Min. : 0.010 | Min. : 8.00 | Min. :10.10 | Min. : 3.00 | Min. : 0.840 | Min. :0.000 | Min. :-1 | Min. :20.4 | Min. : 21.80 | Min. : 0.10 | Min. :0.05 | Min. :115.4 | Min. :0.010 | Min. : 1.00 | Min. : 5.00 | Min. : 2.00 | Min. : 11.00 | |
| 1st Qu.: 92.0 | 1st Qu.:2020-02-04 13:46:00 | 1st Qu.:47.00 | 1st Qu.:1.000 | 1st Qu.:2020-02-01 00:06:16 | 1st Qu.:2020-02-13 19:06:26 | 1st Qu.:0.0000 | 1st Qu.: 4.4 | 1st Qu.:113.0 | 1st Qu.: 99.05 | 1st Qu.: 13.60 | 1st Qu.: 0.040 | 1st Qu.:0.000 | 1st Qu.: 459.5 | 1st Qu.: 54.00 | 1st Qu.:27.40 | 1st Qu.:0.10 | 1st Qu.: 5.00 | 1st Qu.: 7.40 | 1st Qu.:109.0 | 1st Qu.: 2.800 | 1st Qu.: 74.00 | 1st Qu.: 8.675 | 1st Qu.: 3.800 | 1st Qu.:12.00 | 1st Qu.:65.1 | 1st Qu.:61.00 | 1st Qu.: 0.040 | 1st Qu.: 65.00 | 1st Qu.: 0.000 | 1st Qu.: 86.90 | 1st Qu.:33.50 | 1st Qu.: 4.94 | 1st Qu.: 6.70 | 1st Qu.:333.0 | 1st Qu.: 3.050 | 1st Qu.: 5.00 | 1st Qu.: 4.000 | 1st Qu.: 0.460 | 1st Qu.:6.000 | 1st Qu.: 3.680 | 1st Qu.:0.000 | 1st Qu.:2.270 | 1st Qu.: 3.950 | 1st Qu.: 5.550 | 1st Qu.: 3.09 | 1st Qu.: 3.225 | 1st Qu.:10.10 | 1st Qu.: 377.2 | 1st Qu.: 38.50 | 1st Qu.: 15.60 | 1st Qu.: 3.925 | 1st Qu.:0.040 | 1st Qu.: 0.603 | 1st Qu.:3.010 | 1st Qu.: 19.50 | 1st Qu.: 183.2 | 1st Qu.:21.00 | 1st Qu.:1.980 | 1st Qu.: 150 | 1st Qu.: 218.0 | 1st Qu.:25.60 | 1st Qu.: 4.772 | 1st Qu.: 4.00 | 1st Qu.: 0.270 | 1st Qu.:11.10 | 1st Qu.:29.70 | 1st Qu.: 22.00 | 1st Qu.: 1.030 | 1st Qu.:0.010 | 1st Qu.:-1 | 1st Qu.:29.7 | 1st Qu.: 35.30 | 1st Qu.: 5.70 | 1st Qu.:0.07 | 1st Qu.:137.7 | 1st Qu.:0.150 | 1st Qu.: 14.00 | 1st Qu.: 16.00 | 1st Qu.: 63.58 | 1st Qu.: 58.00 | |
| Median :185.0 | Median :2020-02-09 12:50:00 | Median :62.00 | Median :1.000 | Median :2020-02-04 15:53:12 | Median :2020-02-17 21:50:30 | Median :0.0000 | Median : 20.6 | Median :125.0 | Median :102.10 | Median : 14.80 | Median : 0.100 | Median :0.100 | Median : 676.5 | Median : 69.50 | Median :32.20 | Median :0.20 | Median : 5.90 | Median : 10.70 | Median :178.0 | Median : 5.700 | Median : 86.00 | Median : 16.000 | Median : 5.400 | Median :12.60 | Median :82.4 | Median :65.90 | Median : 0.050 | Median : 81.00 | Median : 0.010 | Median : 90.10 | Median :36.60 | Median : 7.72 | Median : 8.60 | Median :343.0 | Median : 4.120 | Median : 5.00 | Median : 5.985 | Median : 0.800 | Median :6.500 | Median : 4.140 | Median :0.010 | Median :2.360 | Median : 4.410 | Median : 6.990 | Median : 5.85 | Median : 4.800 | Median :10.80 | Median : 711.0 | Median : 40.90 | Median : 16.80 | Median :11.450 | Median :0.060 | Median : 2.155 | Median :3.630 | Median : 27.00 | Median : 243.7 | Median :23.50 | Median :2.080 | Median : 585 | Median : 340.0 | Median :30.90 | Median : 19.265 | Median : 17.90 | Median : 0.410 | Median :12.40 | Median :32.70 | Median : 34.00 | Median : 1.140 | Median :0.010 | Median :-1 | Median :30.9 | Median : 39.20 | Median : 51.50 | Median :0.09 | Median :140.4 | Median :0.210 | Median : 28.00 | Median : 24.00 | Median : 87.90 | Median : 76.00 | |
| Mean :184.8 | Mean :2020-02-08 07:09:59 | Mean :59.44 | Mean :1.391 | Mean :2020-02-03 18:57:56 | Mean :2020-02-16 21:40:09 | Mean :0.4747 | Mean : 1223.2 | Mean :123.1 | Mean :103.14 | Mean : 16.68 | Mean : 1.107 | Mean :0.629 | Mean : 907.2 | Mean : 82.47 | Mean :32.01 | Mean :0.21 | Mean : 16.07 | Mean : 16.70 | Mean :184.3 | Mean : 6.155 | Mean : 85.32 | Mean : 83.088 | Mean : 6.889 | Mean :13.07 | Mean :77.6 | Mean :65.30 | Mean : 0.132 | Mean : 78.55 | Mean : 8.306 | Mean : 90.39 | Mean :36.55 | Mean : 15.60 | Mean : 11.58 | Mean :342.8 | Mean : 4.294 | Mean : 6.51 | Mean : 9.589 | Mean : 1.017 | Mean :6.484 | Mean : 9.288 | Mean :0.039 | Mean :2.355 | Mean : 4.509 | Mean : 8.889 | Mean : 7.81 | Mean : 9.887 | Mean :10.91 | Mean : 1379.1 | Mean : 42.44 | Mean : 18.17 | Mean :15.392 | Mean :0.117 | Mean : 7.943 | Mean :3.689 | Mean : 46.53 | Mean : 276.1 | Mean :23.14 | Mean :2.078 | Mean : 3669 | Mean : 474.2 | Mean :31.77 | Mean : 112.308 | Mean : 61.35 | Mean : 0.526 | Mean :13.01 | Mean :33.24 | Mean : 55.34 | Mean : 1.313 | Mean :0.017 | Mean :-1 | Mean :31.0 | Mean : 41.52 | Mean : 76.24 | Mean :0.10 | Mean :141.6 | Mean :0.212 | Mean : 33.69 | Mean : 38.86 | Mean : 81.56 | Mean : 109.93 | |
| 3rd Qu.:270.0 | 3rd Qu.:2020-02-13 10:36:00 | 3rd Qu.:71.00 | 3rd Qu.:2.000 | 3rd Qu.:2020-02-09 02:06:58 | 3rd Qu.:2020-02-19 13:30:26 | 3rd Qu.:1.0000 | 3rd Qu.: 223.8 | 3rd Qu.:137.0 | 3rd Qu.:105.65 | 3rd Qu.: 16.70 | 3rd Qu.: 0.405 | 3rd Qu.:0.800 | 3rd Qu.:1155.5 | 3rd Qu.: 95.00 | 3rd Qu.:36.60 | 3rd Qu.:0.30 | 3rd Qu.: 12.35 | 3rd Qu.: 16.77 | 3rd Qu.:248.0 | 3rd Qu.: 8.600 | 3rd Qu.: 97.00 | 3rd Qu.: 35.200 | 3rd Qu.: 8.000 | 3rd Qu.:13.70 | 3rd Qu.:92.3 | 3rd Qu.:70.45 | 3rd Qu.: 0.070 | 3rd Qu.: 95.00 | 3rd Qu.: 0.010 | 3rd Qu.: 93.90 | 3rd Qu.:39.90 | 3rd Qu.: 12.72 | 3rd Qu.: 11.50 | 3rd Qu.:350.0 | 3rd Qu.: 5.480 | 3rd Qu.: 5.00 | 3rd Qu.:11.400 | 3rd Qu.: 1.310 | 3rd Qu.:7.294 | 3rd Qu.: 4.650 | 3rd Qu.:0.060 | 3rd Qu.:2.440 | 3rd Qu.: 4.870 | 3rd Qu.:10.260 | 3rd Qu.:10.95 | 3rd Qu.: 8.275 | 3rd Qu.:11.50 | 3rd Qu.: 1425.2 | 3rd Qu.: 44.70 | 3rd Qu.: 18.38 | 3rd Qu.:24.975 | 3rd Qu.:0.090 | 3rd Qu.:21.000 | 3rd Qu.:4.265 | 3rd Qu.: 42.00 | 3rd Qu.: 333.8 | 3rd Qu.:25.90 | 3rd Qu.:2.190 | 3rd Qu.: 2625 | 3rd Qu.: 601.8 | 3rd Qu.:37.20 | 3rd Qu.: 60.167 | 3rd Qu.:150.00 | 3rd Qu.: 0.580 | 3rd Qu.:14.30 | 3rd Qu.:36.50 | 3rd Qu.: 58.00 | 3rd Qu.: 1.330 | 3rd Qu.:0.020 | 3rd Qu.:-1 | 3rd Qu.:32.2 | 3rd Qu.: 44.12 | 3rd Qu.:118.50 | 3rd Qu.:0.11 | 3rd Qu.:143.5 | 3rd Qu.:0.270 | 3rd Qu.: 45.50 | 3rd Qu.: 41.00 | 3rd Qu.:103.97 | 3rd Qu.: 98.25 | |
| Max. :375.0 | Max. :2020-02-18 17:49:00 | Max. :95.00 | Max. :2.000 | Max. :2020-02-17 21:30:07 | Max. :2020-03-04 16:21:51 | Max. :1.0000 | Max. :50000.0 | Max. :178.0 | Max. :140.40 | Max. :120.00 | Max. :57.170 | Max. :8.600 | Max. :7500.0 | Max. :620.00 | Max. :48.60 | Max. :1.70 | Max. :1000.00 | Max. :505.70 | Max. :558.0 | Max. :53.000 | Max. :136.00 | Max. :6795.000 | Max. :145.100 | Max. :27.10 | Max. :98.9 | Max. :88.70 | Max. :11.950 | Max. :142.00 | Max. :250.000 | Max. :118.90 | Max. :52.30 | Max. :1726.60 | Max. :168.00 | Max. :514.0 | Max. :10.780 | Max. :88.50 | Max. :68.400 | Max. :52.420 | Max. :7.565 | Max. :749.500 | Max. :0.490 | Max. :2.790 | Max. :12.800 | Max. :43.010 | Max. :33.88 | Max. :360.600 | Max. :15.00 | Max. :50000.0 | Max. :113.30 | Max. :161.90 | Max. :60.000 | Max. :2.090 | Max. :60.000 | Max. :7.300 | Max. :1858.00 | Max. :1176.0 | Max. :36.30 | Max. :2.620 | Max. :70000 | Max. :1867.0 | Max. :62.20 | Max. :5000.000 | Max. :190.80 | Max. :39.920 | Max. :25.30 | Max. :50.60 | Max. :732.00 | Max. :13.480 | Max. :0.120 | Max. :-1 | Max. :50.8 | Max. :144.00 | Max. :320.00 | Max. :0.27 | Max. :179.7 | Max. :0.510 | Max. :110.00 | Max. :1600.00 | Max. :224.00 | Max. :1497.00 | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :5613 | NA’s :5145 | NA’s :5145 | NA’s :5458 | NA’s :5661 | NA’s :5163 | NA’s :5852 | NA’s :5190 | NA’s :5186 | NA’s :5163 | NA’s :5853 | NA’s :5190 | NA’s :5163 | NA’s :5162 | NA’s :5790 | NA’s :5852 | NA’s :5214 | NA’s :5197 | NA’s :5163 | NA’s :5189 | NA’s :5841 | NA’s :5461 | NA’s :5841 | NA’s :5163 | NA’s :5163 | NA’s :4993 | NA’s :5852 | NA’s :5163 | NA’s :5554 | NA’s :5852 | NA’s :5184 | NA’s :5163 | NA’s :5736 | NA’s :4993 | NA’s :5163 | NA’s :5206 | NA’s :5140 | NA’s :5345 | NA’s :5163 | NA’s :5190 | NA’s :5258 | NA’s :5837 | NA’s :5197 | NA’s :5554 | NA’s :5162 | NA’s :5841 | NA’s :5490 | NA’s :5189 | NA’s :5185 | NA’s :5186 | NA’s :5186 | NA’s :5141 | NA’s :5645 | NA’s :5186 | NA’s :5258 | NA’s :5848 | NA’s :5790 | NA’s :5163 | NA’s :5258 | NA’s :5190 | NA’s :5190 | NA’s :5461 | NA’s :5163 | NA’s :5619 | NA’s :5163 | NA’s :5552 | NA’s :5383 | NA’s :5842 | NA’s :5145 | NA’s :5258 | NA’s :5737 | NA’s :5189 | NA’s :5184 | NA’s :5184 |
After cleaning, previously explained in the Introduction:
| patient_id | re_date | age | gender | admission_time | discharge_time | death | hypersensitive_cardiac_troponin_i | hemoglobin | serum_chloride | prothrombin_time | procalcitonin | eosinophils | interleukin_2_receptor | alkaline_phosphatase | albumin | basophil | interleukin_10 | total_bilirubin | platelet_count | monocytes | antithrombin | interleukin_8 | indirect_bilirubin | red_blood_cell_distribution_width | neutrophils | total_protein | quantification_of_treponema_pallidum_antibodies | prothrombin_activity | h_bs_ag | mean_corpuscular_volume | hematocrit | white_blood_cell_count | tumor_necrosis_factor_u_03b1 | mean_corpuscular_hemoglobin_concentration | fibrinogen | interleukin_1ss | urea | lymphocyte_count | ph_value | red_blood_cell_count | eosinophil_count | corrected_calcium | serum_potassium | glucose | neutrophils_count | direct_bilirubin | mean_platelet_volume | ferritin | rbc_distribution_width_sd | thrombin_time | x_lymphocyte | hcv_antibody_quantification | d_d_dimer | total_cholesterol | aspartate_aminotransferase | uric_acid | hco3 | calcium | amino_terminal_brain_natriuretic_peptide_precursor_nt_pro_bnp | lactate_dehydrogenase | platelet_large_cell_ratio | interleukin_6 | fibrin_degradation_products | monocytes_count | plt_distribution_width | globulin | x_u_03b3_glutamyl_transpeptidase | international_standard_ratio | basophil_count | x2019_n_co_v_nucleic_acid_detection | mean_corpuscular_hemoglobin | activation_of_partial_thromboplastin_time | high_sensitivity_c_reactive_protein | hiv_antibody_quantification | serum_sodium | thrombocytocrit | esr | glutamic_pyruvic_transaminase | e_gfr | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. :2020-01-10 19:45:00 | Min. :18.00 | Min. :1.000 | Min. :2020-01-10 15:52:20 | Min. :2020-01-23 09:09:23 | Min. :0.0000 | Min. : 1.9 | Min. : 6.4 | Min. : 71.5 | Min. : 11.50 | Min. : 0.0200 | Min. :0.0000 | Min. : 61.0 | Min. : 17.00 | Min. :13.60 | Min. :0.0000 | Min. : 5.00 | Min. : 2.50 | Min. : -1.0 | Min. : 0.300 | Min. : 20.00 | Min. : 5.00 | Min. : 0.100 | Min. :10.60 | Min. : 1.70 | Min. :31.80 | Min. : 0.0200 | Min. : 6.00 | Min. : 0.00 | Min. : 61.60 | Min. :14.50 | Min. : 0.13 | Min. : 4.0 | Min. :286.0 | Min. : 0.500 | Min. : 5.000 | Min. : 0.800 | Min. : 0.0000 | Min. :5.000 | Min. : 0.100 | Min. :0.00000 | Min. :1.650 | Min. : 2.760 | Min. : 1.000 | Min. : 0.060 | Min. : 1.600 | Min. : 8.50 | Min. : 17.8 | Min. : 31.30 | Min. : 13.00 | Min. : 0.00 | Min. :0.02000 | Min. : 0.210 | Min. :0.10 | Min. : 6.00 | Min. : 43.0 | Min. : 6.30 | Min. :1.170 | Min. : 5 | Min. : 110 | Min. :11.20 | Min. : 1.50 | Min. : 4.00 | Min. : 0.0100 | Min. : 8.00 | Min. :10.10 | Min. : 3.00 | Min. : 0.840 | Min. :0.00000 | Min. :-1 | Min. :20.40 | Min. : 21.80 | Min. : 0.10 | Min. :0.05000 | Min. :115.4 | Min. :0.0100 | Min. : 1.00 | Min. : 5.0 | Min. : 2.00 | Min. : 11.0 | |
| 1st Qu.: 92.0 | 1st Qu.:2020-02-04 13:46:00 | 1st Qu.:47.00 | 1st Qu.:1.000 | 1st Qu.:2020-02-01 00:06:16 | 1st Qu.:2020-02-13 19:06:26 | 1st Qu.:0.0000 | 1st Qu.: 3.7 | 1st Qu.:114.0 | 1st Qu.: 98.8 | 1st Qu.: 13.50 | 1st Qu.: 0.0400 | 1st Qu.:0.0000 | 1st Qu.: 585.0 | 1st Qu.: 54.00 | 1st Qu.:28.40 | 1st Qu.:0.1000 | 1st Qu.: 5.00 | 1st Qu.: 7.20 | 1st Qu.:121.0 | 1st Qu.: 3.100 | 1st Qu.: 84.00 | 1st Qu.: 12.60 | 1st Qu.: 3.600 | 1st Qu.:11.90 | 1st Qu.:65.10 | 1st Qu.:62.20 | 1st Qu.: 0.0400 | 1st Qu.: 70.00 | 1st Qu.: 0.00 | 1st Qu.: 86.80 | 1st Qu.:33.80 | 1st Qu.: 4.84 | 1st Qu.: 7.7 | 1st Qu.:334.0 | 1st Qu.: 3.400 | 1st Qu.: 5.000 | 1st Qu.: 3.840 | 1st Qu.: 0.5000 | 1st Qu.:6.000 | 1st Qu.: 3.710 | 1st Qu.:0.00000 | 1st Qu.:2.270 | 1st Qu.: 3.920 | 1st Qu.: 5.630 | 1st Qu.: 2.980 | 1st Qu.: 3.200 | 1st Qu.:10.20 | 1st Qu.: 582.5 | 1st Qu.: 38.40 | 1st Qu.: 15.80 | 1st Qu.: 4.50 | 1st Qu.:0.05000 | 1st Qu.: 0.510 | 1st Qu.:2.97 | 1st Qu.: 20.00 | 1st Qu.: 185.0 | 1st Qu.:21.00 | 1st Qu.:2.000 | 1st Qu.: 111 | 1st Qu.: 226 | 1st Qu.:26.60 | 1st Qu.: 13.33 | 1st Qu.: 4.70 | 1st Qu.: 0.2800 | 1st Qu.:11.30 | 1st Qu.:30.10 | 1st Qu.: 21.00 | 1st Qu.: 1.030 | 1st Qu.:0.01000 | 1st Qu.:-1 | 1st Qu.:29.70 | 1st Qu.: 36.40 | 1st Qu.: 8.70 | 1st Qu.:0.08000 | 1st Qu.:137.3 | 1st Qu.:0.1400 | 1st Qu.: 18.00 | 1st Qu.: 15.0 | 1st Qu.: 66.80 | 1st Qu.: 58.0 | |
| Median :185.0 | Median :2020-02-09 12:50:00 | Median :62.00 | Median :1.000 | Median :2020-02-04 15:53:12 | Median :2020-02-17 21:50:30 | Median :0.0000 | Median : 12.9 | Median :126.0 | Median :101.7 | Median : 14.30 | Median : 0.1000 | Median :0.1000 | Median : 778.0 | Median : 68.00 | Median :33.05 | Median :0.2000 | Median : 7.50 | Median : 10.30 | Median :180.0 | Median : 6.000 | Median : 88.00 | Median : 17.10 | Median : 5.300 | Median :12.50 | Median :80.90 | Median :66.70 | Median : 0.0500 | Median : 86.00 | Median : 0.01 | Median : 89.80 | Median :36.90 | Median : 7.33 | Median : 8.7 | Median :343.0 | Median : 4.410 | Median : 5.000 | Median : 5.600 | Median : 0.7900 | Median :6.500 | Median : 4.160 | Median :0.01000 | Median :2.360 | Median : 4.330 | Median : 6.960 | Median : 5.420 | Median : 4.600 | Median :10.80 | Median : 826.8 | Median : 40.60 | Median : 16.70 | Median :12.40 | Median :0.06000 | Median : 1.350 | Median :3.59 | Median : 28.50 | Median : 243.4 | Median :23.20 | Median :2.100 | Median : 332 | Median : 338 | Median :31.40 | Median : 25.36 | Median : 7.40 | Median : 0.4000 | Median :12.60 | Median :33.10 | Median : 33.00 | Median : 1.100 | Median :0.01000 | Median :-1 | Median :30.90 | Median : 39.40 | Median : 51.90 | Median :0.09000 | Median :140.1 | Median :0.2000 | Median : 31.00 | Median : 23.0 | Median : 89.20 | Median : 76.0 | |
| Mean :184.8 | Mean :2020-02-08 07:09:59 | Mean :59.44 | Mean :1.391 | Mean :2020-02-03 18:57:56 | Mean :2020-02-16 21:40:09 | Mean :0.4747 | Mean : 800.5 | Mean :125.1 | Mean :102.3 | Mean : 15.51 | Mean : 0.6811 | Mean :0.5653 | Mean : 910.1 | Mean : 80.71 | Mean :32.72 | Mean :0.2012 | Mean : 15.11 | Mean : 15.74 | Mean :187.6 | Mean : 6.357 | Mean : 88.11 | Mean : 41.74 | Mean : 6.789 | Mean :12.99 | Mean :77.12 | Mean :66.24 | Mean : 0.1453 | Mean : 82.62 | Mean : 4.91 | Mean : 90.04 | Mean :36.78 | Mean : 12.39 | Mean : 10.7 | Mean :343.4 | Mean : 4.476 | Mean : 5.947 | Mean : 8.359 | Mean : 0.9744 | Mean :6.434 | Mean : 7.756 | Mean :0.03407 | Mean :2.351 | Mean : 4.408 | Mean : 8.643 | Mean : 7.429 | Mean : 8.983 | Mean :10.98 | Mean : 1288.9 | Mean : 42.05 | Mean : 17.76 | Mean :15.73 | Mean :0.09682 | Mean : 6.297 | Mean :3.65 | Mean : 41.89 | Mean : 271.4 | Mean :23.01 | Mean :2.094 | Mean : 1920 | Mean : 453 | Mean :32.32 | Mean : 73.12 | Mean : 36.54 | Mean : 0.4894 | Mean :13.19 | Mean :33.50 | Mean : 54.76 | Mean : 1.235 | Mean :0.01592 | Mean :-1 | Mean :30.93 | Mean : 40.58 | Mean : 75.75 | Mean :0.09457 | Mean :140.6 | Mean :0.2063 | Mean : 34.55 | Mean : 34.5 | Mean : 83.74 | Mean : 99.1 | |
| 3rd Qu.:270.0 | 3rd Qu.:2020-02-13 10:36:00 | 3rd Qu.:71.00 | 3rd Qu.:2.000 | 3rd Qu.:2020-02-09 02:06:58 | 3rd Qu.:2020-02-19 13:30:26 | 3rd Qu.:1.0000 | 3rd Qu.: 38.6 | 3rd Qu.:138.0 | 3rd Qu.:104.6 | 3rd Qu.: 15.80 | 3rd Qu.: 0.3100 | 3rd Qu.:0.7000 | 3rd Qu.:1026.0 | 3rd Qu.: 91.00 | 3rd Qu.:37.30 | 3rd Qu.:0.3000 | 3rd Qu.: 9.90 | 3rd Qu.: 15.40 | 3rd Qu.:245.0 | 3rd Qu.: 8.800 | 3rd Qu.: 92.00 | 3rd Qu.: 27.10 | 3rd Qu.: 7.700 | 3rd Qu.:13.50 | 3rd Qu.:91.60 | 3rd Qu.:70.80 | 3rd Qu.: 0.0600 | 3rd Qu.: 96.00 | 3rd Qu.: 0.01 | 3rd Qu.: 93.40 | 3rd Qu.:40.10 | 3rd Qu.: 12.15 | 3rd Qu.: 10.4 | 3rd Qu.:351.0 | 3rd Qu.: 5.410 | 3rd Qu.: 5.000 | 3rd Qu.: 9.625 | 3rd Qu.: 1.2800 | 3rd Qu.:6.500 | 3rd Qu.: 4.603 | 3rd Qu.:0.05000 | 3rd Qu.:2.430 | 3rd Qu.: 4.780 | 3rd Qu.: 9.780 | 3rd Qu.:10.450 | 3rd Qu.: 7.500 | 3rd Qu.:11.60 | 3rd Qu.: 1185.9 | 3rd Qu.: 44.10 | 3rd Qu.: 17.90 | 3rd Qu.:24.60 | 3rd Qu.:0.08000 | 3rd Qu.:11.610 | 3rd Qu.:4.21 | 3rd Qu.: 43.00 | 3rd Qu.: 328.0 | 3rd Qu.:25.50 | 3rd Qu.:2.190 | 3rd Qu.: 843 | 3rd Qu.: 574 | 3rd Qu.:37.60 | 3rd Qu.: 46.28 | 3rd Qu.: 25.80 | 3rd Qu.: 0.5800 | 3rd Qu.:14.60 | 3rd Qu.:36.52 | 3rd Qu.: 57.00 | 3rd Qu.: 1.250 | 3rd Qu.:0.02000 | 3rd Qu.:-1 | 3rd Qu.:32.10 | 3rd Qu.: 43.40 | 3rd Qu.:118.10 | 3rd Qu.:0.10000 | 3rd Qu.:142.7 | 3rd Qu.:0.2600 | 3rd Qu.: 43.00 | 3rd Qu.: 38.0 | 3rd Qu.:105.00 | 3rd Qu.: 97.0 | |
| Max. :375.0 | Max. :2020-02-18 17:49:00 | Max. :95.00 | Max. :2.000 | Max. :2020-02-17 21:30:07 | Max. :2020-03-04 16:21:51 | Max. :1.0000 | Max. :50000.0 | Max. :178.0 | Max. :140.4 | Max. :120.00 | Max. :57.1700 | Max. :8.6000 | Max. :7500.0 | Max. :620.00 | Max. :48.60 | Max. :1.7000 | Max. :1000.00 | Max. :505.70 | Max. :558.0 | Max. :53.000 | Max. :136.00 | Max. :6795.00 | Max. :145.100 | Max. :27.10 | Max. :98.90 | Max. :88.70 | Max. :11.9500 | Max. :142.00 | Max. :250.00 | Max. :118.90 | Max. :52.30 | Max. :1726.60 | Max. :168.0 | Max. :514.0 | Max. :10.780 | Max. :88.500 | Max. :68.400 | Max. :52.4200 | Max. :7.565 | Max. :749.500 | Max. :0.49000 | Max. :2.790 | Max. :12.800 | Max. :43.010 | Max. :33.880 | Max. :360.600 | Max. :15.00 | Max. :50000.0 | Max. :113.30 | Max. :161.90 | Max. :60.00 | Max. :2.09000 | Max. :60.000 | Max. :7.30 | Max. :1858.00 | Max. :1176.0 | Max. :36.30 | Max. :2.620 | Max. :70000 | Max. :1867 | Max. :62.20 | Max. :5000.00 | Max. :190.80 | Max. :39.9200 | Max. :25.30 | Max. :50.60 | Max. :732.00 | Max. :13.480 | Max. :0.12000 | Max. :-1 | Max. :50.80 | Max. :144.00 | Max. :320.00 | Max. :0.27000 | Max. :179.7 | Max. :0.5100 | Max. :110.00 | Max. :1600.0 | Max. :224.00 | Max. :1497.0 |
Age appears to be directly correlated with dying of the disease.
A slight negative correlation between the length of the hospital stay and death can be seen. This is consistent with what has been noted in the overview of the data.
The below plot has been limited to attributes that have a (positive or negative) correlation of at least 70%. Otherwise it would become unreadable. The names have been abbreviated for the same reason. Please refer to the below table for original names of the attributes.
Negative correlation can be noticed between death and the lymphocyte sample results.
| abbreviated_name | original_name |
|---|---|
| death | death |
| hemglbn | hemoglobin |
| srm_chl | serum_chloride |
| prthrmbn_t | prothrombin_time |
| esnphls | eosinophils |
| albumin | albumin |
| ttl_blr | total_bilirubin |
| pltlt_c | platelet_count |
| moncyts | monocytes |
| indrct_ | indirect_bilirubin |
| rd_b___ | red_blood_cell_distribution_width |
| ntrphls | neutrophils |
| mn_crpsclr_v | mean_corpuscular_volume |
| hemtcrt | hematocrit |
| urea | urea |
| lymphc_ | lymphocyte_count |
| esnphl_ | eosinophil_count |
| ntrphl_ | neutrophils_count |
| drct_bl | direct_bilirubin |
| mn_plt_ | mean_platelet_volume |
| rbc_d__ | rbc_distribution_width_sd |
| x_lymph | x_lymphocyte |
| d_d_dmr | d_d_dimer |
| asprtt_ | aspartate_aminotransferase |
| calcium | calcium |
| pltl___ | platelet_large_cell_ratio |
| fbrn_d_ | fibrin_degradation_products |
| mncyts_ | monocytes_count |
| plt_ds_ | plt_distribution_width |
| intrn__ | international_standard_ratio |
| mn_crpsclr_h | mean_corpuscular_hemoglobin |
| srm_sdm | serum_sodium |
| thrmbcy | thrombocytocrit |
| gltmc__ | glutamic_pyruvic_transaminase |
| e_gfr | e_gfr |
| creatnn | creatinine |
In the attempt at classification, the random forest learning method is used. For this purpose the dataset is transformed to the following:
| age | gender | death | x_lymphocyte_min | x_lymphocyte_max | lactate_dehydrogenase_min | lactate_dehydrogenase_max | high_sensitivity_c_reactive_protein_min | high_sensitivity_c_reactive_protein_max | |
|---|---|---|---|---|---|---|---|---|---|
| Min. :18.00 | Min. :1.000 | no :201 | Min. : 0.00 | Min. : 0.00 | Min. : 110.0 | Min. : 119.0 | Min. : 0.10 | Min. : 0.10 | |
| 1st Qu.:46.00 | 1st Qu.:1.000 | yes:174 | 1st Qu.: 4.60 | 1st Qu.: 6.65 | 1st Qu.: 231.0 | 1st Qu.: 248.5 | 1st Qu.: 10.35 | 1st Qu.: 16.95 | |
| Median :62.00 | Median :1.000 | NA | Median :12.40 | Median :14.10 | Median : 338.0 | Median : 340.0 | Median : 51.70 | Median : 53.10 | |
| Mean :58.83 | Mean :1.403 | NA | Mean :15.30 | Mean :17.47 | Mean : 420.6 | Mean : 481.4 | Mean : 64.34 | Mean : 81.79 | |
| 3rd Qu.:70.00 | 3rd Qu.:2.000 | NA | 3rd Qu.:23.65 | 3rd Qu.:25.90 | 3rd Qu.: 518.5 | 3rd Qu.: 594.0 | 3rd Qu.: 97.10 | 3rd Qu.:131.15 | |
| Max. :95.00 | Max. :2.000 | NA | Max. :55.00 | Max. :60.00 | Max. :1867.0 | Max. :1867.0 | Max. :320.00 | Max. :320.00 |
For each patient, aside from their age and gender, the minimum and maximum values (observed during their hospital stay) of
are considered. As mentioned in the introduction of this analysis, the biomarker choice is based on Tan et al article.
Only samples taken during the 5 day window after each patient’s admission (63% of all samples) were selected for learning. The purpose of this is to simulate an attempt of classifying a patient after they have already been hospitalized for a few days. Most patients are hospitalized for more than 5 days, so there may be purpose in doing this.
70% of the dataset is used as train data and the other 30% are test data.
## Random Forest
##
## 263 samples
## 8 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 236, 237, 237, 237, 236, 237, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8797090 0.7581506
## 3 0.8728734 0.7443456
## 4 0.8785368 0.7558239
## 5 0.8730993 0.7446359
## 6 0.8777717 0.7543753
## 7 0.8775438 0.7541035
## 8 0.8774298 0.7532245
## 9 0.8768315 0.7525498
## 10 0.8889967 0.7771683
## 11 0.8722731 0.7435351
## 12 0.8875722 0.7743328
## 13 0.8775458 0.7532014
## 14 0.8829263 0.7649736
## 15 0.8685674 0.7362131
## 16 0.8836691 0.7663287
## 17 0.8843244 0.7676563
## 18 0.8776313 0.7540705
## 19 0.8767745 0.7526481
## 20 0.8738970 0.7467661
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 52 0
## yes 8 52
##
## Accuracy : 0.9286
## 95% CI : (0.8641, 0.9687)
## No Information Rate : 0.5357
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8579
##
## Mcnemar's Test P-Value : 0.01333
##
## Precision : 1.0000
## Recall : 0.8667
## F1 : 0.9286
## Prevalence : 0.5357
## Detection Rate : 0.4643
## Detection Prevalence : 0.4643
## Balanced Accuracy : 0.9333
##
## 'Positive' Class : no
##
## rf variable importance
##
## Overall
## lactate_dehydrogenase_max 66.8163
## age 19.3024
## x_lymphocyte_min 13.3378
## x_lymphocyte_max 12.8294
## high_sensitivity_c_reactive_protein_min 8.0129
## high_sensitivity_c_reactive_protein_max 6.0812
## lactate_dehydrogenase_min 3.8810
## gender 0.1193
To evaluate the above model we need to know what we want to achieve with this classification.
Let’s assume that we want to find out whether a patient will die to try to save them before it happens. The death class value “no” is a positive and we do not need to be too concerned with such patient. The death class value “yes” is a negative and we need to take special care of them. In this case, the precision measure of 1 achieved by the model on the test data is a good sign.
The recall of 0.87 could be an issue during a pandemic. The false negatives are a potential drawback due to the fact that the medical system is strained as is. Suggesting that more patients need immediate care could waste precious resources.
The elderly are more likely to die.
High values of lactic dehydrogenase measured within 5 days of admission indicate that a patient will die. So do low values of lymphocyte. Among the biomarkers mentioned by Tan et al these two were of the most importance in the model.